-
Notifications
You must be signed in to change notification settings - Fork 818
Update bos_token #4806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update bos_token #4806
Conversation
Nice catch! Should we modify the function For example: def _apply_chat_template_to_messages_list(self, messages_list: InputsType):
prompts_text = []
for messages in messages_list:
InferRequest.remove_response(messages)
template_inputs, _ = StdTemplateInputs.from_dict({'messages': messages})
res_context_list, _, _ = self.template._swift_encode(template_inputs)
- prompts_text.append(''.join(res_context_list))
+ prompts_text.append(''.join(elem for elem in res_context_list if isinstance(elem, str)))
return prompts_text |
@hjh0119 I have read the full self.template._swift_encode(template_inputs) function, the return result is list, if the element of res_context_list is list, this is not corrent result, here should raise an exception instead of ignore the error |
if elem type is list, that means the elem is token_id value, not text value. |
I believe the related issue arises because Therefore, I think ignoring integer values in the Are there any training scripts that fail to run on the main branch but work with this PR? |
yes, the related issue arises because template_meta.prefix , I can not agree with you any more. you can see my other pr, this pr solve template_meta.prefix bug。#4813 |
so are there any training/infer scripts that fail to run on the main branch but work with this PR? |
just replace your_own_path in --model, --output_dir, you can test
|
I've tested the script you provided using your commit, but it still raises the following error: TypeError: sequence item 0: expected str instance, list found. |
@@ -1039,8 +1039,16 @@ def _swift_encode(self, inputs: StdTemplateInputs): | |||
idx = all_tokens.index(single_token[0]) | |||
bos_token = all_tokens[:idx] | |||
sep_token = all_tokens[idx + 1:] | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto_add_bos
is False for Qwen models
So the encode logic of Qwen models will not reach here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this function will not work for Qwen models because the qwen.py:QwenTemplateMeta class set auto_add_bos = False, it will be raise exception for deepseek model
swift/llm/template/template/qwen.py
@dataclass
class QwenTemplateMeta(ChatmlTemplateMeta):
default_system: Optional[str] = DEFAULT_SYSTEM
auto_add_bos: bool = False
stop_words: List[Word] = field(default_factory=lambda: ['<|endoftext|>'])
agent_template: str = 'hermes'
you should fix #4806 bug first, I have commit pr for this bug |
PR type
PR information
所以需要判断bos_token是否为空
2. 若bos_token不为空,则表示需要加specal token,bos_token = all_tokens[:idx]得到的是列表,在执行
prompts_text.append(''.join(res_context_list))
时会因为res_context_list[0]的类型是list而报错,并且这里的bos_token是token_id而不是token,在encode时也会出错,所以采用self.tokenizer.bos_token更为合适